PPO Summary

So that’s it! We can finally summarize the PPO algorithm

First, collect some trajectories based on some policy \pi_\theta, and initialize theta prime \theta'=\theta
Next, compute the gradient of the clipped surrogate function using the trajectories
Update \theta' using gradient ascent \theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)
Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
Set \theta=\theta', go back to step 1, repeat.

The details of PPO was originally published by the team at OpenAI, and you can read their paper through this link